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Abstract 

Stochastic variational inference allows for fast 
posterior inference in complex Bayesian mod¬ 
els. However, the algorithm is prone to local 
optima which can make the quality of the poste¬ 
rior approximation sensitive to the choice of hy¬ 
perparameters and initialization. We address this 
problem by replacing the natural gradient step of 
stochastic varitional inference with a trust-region 
update. We show that this leads to generally bet¬ 
ter results and reduced sensitivity to hyperparam¬ 
eters. We also describe a new strategy for varia¬ 
tional inference on streaming data and show that 
here our trust-region method is crucial for getting 
good performance. 

1. Introduction 

Stochastic variational inference (SVI; Hoffman et al., 
2013) has enabled variational inference on massive datasets 
for a large class of complex Bayesian models. It has been 
applied to, for example, topic models (Hoffman et al., 
2010), nonparametric models (Wang et al., 2011; Paisley 
et al., 2012), mixture models (Hughes & Sudderth, 2013), 
and matrix factorizations (Gopalan et al., 2014). How¬ 
ever, it has been observed that SVI can be sensitive to 
the choice of hyperparameters and is prone to local op¬ 
tima (Ranganath et al., 2013; Hughes & Sudderth, 2013; 
Hoffman & Blei, 2014). Successful attempts to improve 
its performance include the automatic tuning of learning 
rates (Ranganath et al., 2013) and variance reduction tech¬ 
niques (Mandt & Blei, 2014). The results of Hoffman & 
Blei (2014) suggest that local optima in the objective func¬ 
tion caused by a mean-field approximation (Wainwright & 
Jordan, 2008) can be eliminated by means of more complex 
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families of approximating distributions. They also find that 
the standard search heuristics used in variational inference 
consistently fail to find the best local optima of the mean- 
field objective function, suggesting that variational infer¬ 
ence algorithms could be improved by employing more ro¬ 
bust optimization algorithms. In the first part of this paper 
we propose a new learning algorithm that replaces the nat¬ 
ural gradient steps at the core of SVI with trust-region up¬ 
dates. We show that these trust-region updates are feasible 
in practice and lead to a more robust algorithm generally 
yielding better performance. 

In the second part of the paper we study the setting of 
continuous streams of data. SVI is a promising candidate 
for fitting Bayesian models to data streams, yet the depen¬ 
dence of its updates on the dataset size and its underlying 
assumption of uniform sampling from the entire dataset 
have hampered its application in practice. We therefore 
propose a new strategy for applying SVI in the streaming 
setting. We show that this strategy fails when used with 
natural gradient steps but works well with our trust-region 
method. When applied to latent Drrichlet allocation (LDA; 
Blei et al., 2003), our method is able to continuously in¬ 
tegrate new data points not only at the document level but 
also at the word level, which existing methods for varia¬ 
tional inference on streaming data cannot do. 

2. A trust-region method for stochastic 
variational inference 

In the following section, after introducing some notation 
and outlining the model class studied in this paper, we 
briefly review stochastic variational inference (SVI) in or¬ 
der to facilitate comparison with our trust-region extension. 

2.1. Model assumptions 

Our basic model assumptions are summarized by' 
p(/3) = /i(/3) exp t{l3) - a{r})^ , 


( 1 ) 
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Figure 1. A graphical model representation of the model class 
considered in this paper. 


p(x, Z I /3) — /i(Xn, Zn) exp (i(/3)^/(Xn,Z„)j . (2) 
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perform poorly due to local optima rather than the inflex¬ 
ibility of the approximating family (see also Wainwright 
& Jordan, 2008). We therefore focus our attention on the 
mean-field approximation in the following. 

2.3. Stochastic variational inference 

Extending the work by Sato (2001), Hoffman et al. (2013) 
have shown that mean-field variational inference is equiva¬ 
lent to natural gradient ascent on 

£(A) = max,^ £(A, (/)). (6) 


That is, we have global parameters f3 whose prior distribu¬ 
tion is an exponential family governed by natural parame¬ 
ters T), and N conditionally independent pairs of local pa¬ 
rameters Zn and observations x„ whose exponential-family 
distribution is controlled by /3 (Figure 1). While the basic 
strategy presented in this paper might also be applied in the 
non-conjugate case, for simplicity we assume conjugacy 
between the two exponential-family distributions. 

Instances of this model class considered in greater detail in 
this paper are latent Dirichlet allocation (Blei et al., 2003) 
and mixture models. Other instances include (but are not 
limited to) hidden Markov models, probabilistic matrix fac¬ 
torizations, and hierarchical linear and probit regression 
(Hoffman & Blei, 2014). 


2.2. Mean-field variational inference 


Mean-field variational inference approximates the posterior 
distribution over latent variables with a factorial distribu¬ 
tion, which here we take to be of the form 

g(/3, z) = q{f3) q{z„m), (3) 

n,m 

q{f3) = h{^) exp {x^t{P) - a(A)^ . (4) 

We further assume that q{znm) is controlled by parameters 
4>nm- Inference proceeds by altematingly updating each 
factor to maximize the evidence lower bound (ELBO), 


/1(A, 0) — Eg 
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P(/3) 


N 

+ ^Eg 
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(5) 


which is equivalent to minimizing the Kullback-Leibler di¬ 
vergence between q and the true posterior distribution over 
13 and z. While the trust-region method described below 
might also be applied to more complex approximations, we 
are particularly interested to see how much a mean-field 
approximation can be improved via better learning algo¬ 
rithms. As mentioned above, the results of Hoffman & Blei 
(2014) suggest that mean-field approximations might often 


This interpretation enables the derivation of stochastic nat¬ 
ural gradient algorithms for variational inference, in this 
context called stochastic variational inference (SVI). 

For a uniformly at random selected data point n. 
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represents an unbiased stochastic approximation of the 
ELBO given in Equation 5. Similarly, 


En(A) — En(A, </)„), (8) 

4>n = argmax^^ £„(A, 0„) (9) 

represents an unbiased estimate of the lower bound in 
Equation 6. For simplicity, in the following we only 
consider stochastic approximations based on a single data 
point. An extension to batches of multiple data points is 
straightforward. A step in the direction of the natural gra¬ 
dient of jCn(A) scaled by a learning rate of pt is given by 
(Hoffman et al., 2013) 

At+i = (1 - pt)At +pt (^V + [/(xn, z„)] j . (10) 


2.4. Trust-region updates 

As has been observed before (Hughes & Sudderth, 2013; 
Hoffman & Blei, 2014) and as we will corroborate further 
in Section 4, SVI can be prone to local optima. One possi¬ 
ble solution to this problem is to use a more flexible fam¬ 
ily of approximating distributions (Hoffman & Blei, 2014), 
effectively smoothing the objective function but sacrificing 
the speed and convenience of a mean-field approximation. 
In contrast, here we try to address the issue of local optima 
by improving on the optimization algorithm. 

Instead of performing a natural gradient step (Equation 10), 
we propose the following stochastic update; 


At+i =argmax;^ {£„(A) - >^t)} • (H) 


*The log-normalizer of the likelihood p(x, z | /3) is absorbed 
into t(/3) and f{xn., z„) to simplify the notation. 


Intuitively, this trust-region step tries to find the optimal 
solution to a stochastic approximation of the lower bound. 
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Algorithm 1 Trust-region SVI 
Set t = 0, initialize Aq randomly 

repeat 

Select Xn uniformly from the dataset 
Initialize 

repeat 

A (1 - pt)At + Pt (^T] + NE^* lf(x„, Zn)]j 
^ argmax^^r„(A,</>„) 
until convergence 
At+i -s— A 
t-i^t + 1 
until convergence 


regularized to prevent the distribution over parameters from 
changing too much. The degree of change between distri¬ 
butions is measured by the Kullback-Leibler (KL) diver¬ 
gence T>kl^ and the parameter > 0 controls the strength 
of the regularization. If we could solve Equation 11 in 
closed form, this algorithm would at least no longer be 
hampered by local optima in £n. In general this will not 
be possible so that we have to resort to the following opti¬ 
mization scheme. 

For fixed <pn, the natural gradient of the objective on the 
right-hand side of Equation 11 is given by 

V + iVE^. [/(x„, Zn)] - A + Ci(At - A). (12) 

This follows from the the natural gradient of £n (Hoffman 
et al., 2013) and the fact that the gradient of the Kullback- 
Leibler divergence between two distributions from an exp- 
nential family in canonical form is given by 

VAf3KL(A,A') =/(A)(A-A'), (13) 

where J(A) is the Fisher information matrix (see Supple¬ 
mentary Section 1 for a derivation). Setting the gradient to 
zero yields 

A= (1 - pt)At -I-pt (j] + NE^^[f{Xn,Zn)]^ , (14) 

defining pt = (1 -I- We propose to solve Equation 11 

via alternating coordinate ascent, that is, by altematingly 
computing cj)n (Equation 9) and A (Equation 14) until some 
stopping criterion is reached. Note that natural gradient 
ascent can be seen as a special case of this trust-region 
method where A is initialized with At and only one iter¬ 
ation of updates to (pn and A is performed. 

The need to iteratively optimize A makes each trust-region 
step more costly than a simple natural gradient step. How¬ 
ever, the additional overhead is often smaller than one 
might expect. For many models (such as EDA), the 
dominant cost is solving the sub-problem of computing 
argmax^^£(A, which must be done iteratively. This 


sub-problem can be initialized with the previous value of 
4>n', when A is near convergence, this initialization will al¬ 
ready be near a (local) optimum, and the sub-problem can 
be solved quickly. 

In the limit of large the solution to Equation 11 will be¬ 
come identical to the natural gradient step in Equation 10 
(see Supplementary Section 2). Consequently, the conver¬ 
gence guarantees that hold for SVI carry over to our trust 
region method. That is, under certain regularity assump¬ 
tions and for appropriately decaying p* (Bottou, 1998), it¬ 
eratively applying Equation 11 will converge to a local op¬ 
timum of £. 

For any finite ^t, the two update steps will generally be 
different. A crucial advantage of the trust-region method is 
that in each iteration we can initialize A and cpn arbitrarily 
before applying the alternating optimization scheme. This 
way we can hope to jump out of local optima. The optimal 
initialization will depend on the data and the specific model 
being used. However, in our experiments with EDA and 
mixture models we found that a generally useful strategy 
is to initialize cpn such that the beliefs over z„ are uniform 
and to compute the initial A with these beliefs. 

The general algorithm is summarized in Algorithm 1 . More 
detailed derivations for EDA and mixture models can be 
found in the supplementary material. 

2.5. Related work 

Our optimization resembles mirror descent (Nemirovski & 
Yudin, 1983; Beck & Teboulle, 2003), which applied to 
the lower bound (Equation 6) corresponds to updates of the 
form 

At+i = argmax;^ {{VA-Cn(At), A - At) - At)} (15) 

for some Bregman divergence B. The main difference to 
our algorithm is that we try to optimize £n exactly instead 
of a first-order approximation. 

Trust-region methods have a long history in optimization 
(Nocedal & Wright, 1999) and are frequently brought to 
bear on machine learning problems (e.g., Lin et al., 2007; 
Pascanu et al., 2014). However, we are unaware of any 
other trust-region method which omits a local approxima¬ 
tion (Equation 11). 

3. Streaming 

Despite its sequential nature, SVI has not found widespread 
use in the streaming setting. One reason is that it assumes 
that the dataset is fixed and its size, N, known in advance. 
In contrast, streaming data only becomes available over 
time and is potentially infinite. Another disadvantage of 
a naive application of SVI to the streaming setting where 
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data points are processed once as they arrive is that it might 
not make the best use of the available computational re¬ 
sources. In real world applications, data points rarely ar¬ 
rive at a constant rate. Processing data points as they arrive 
thus means that there will be times when a computer has to 
quickly process many data points and other times where it 
will be idle. 


et al., 2013; Tank et al., 2014). Unlike SVI, its updates are 
independent of the dataset size. The basic idea behind SVB 
is that of assumed density filtering (e.g., Maybeck, 1982; 
Minka, 2001): the posterior is updated one data point at a 
time using Bayes rule and approximated between each two 
updates. Applied to our model class and ignoring the local 
variables for a moment, an update can be summarized as 


In the following, we propose an alternative but similarly 
straightforward application of SVI to the streaming setting. 
As we will show in Section 4, this algorithm gives poor 
results with natural gradient steps as local optima and sen¬ 
sitivity to hyperparameters are particularly problematic in 
the streaming setting. However, we find that it performs 
well using our trust-region updates. 


3.1. Streaming SVI 

Rather than processing a data point once when it arrives, we 
suggest continuously optimizing an evolving lower bound. 
Instead of updating parameters directly, each new data 
point is simply added to a database. At time t, we optimize 
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(16) 


where A* is the number of observed data points, by select¬ 
ing a data point (or batch of data points) from the database 
uniformly at random and performing either a natural gradi¬ 
ent or trust-region update. 

Learning algorithms whose performance is robust to 
changes in hyperparameter settings are especially impor¬ 
tant in the streaming setting where performing test runs and 
cross-validation is often not an option. In addition to using 
trust-region updates we therefore employ empirical Bayes. 
After updating A by performing a trust-region step (Equa¬ 
tion 11 ), we update t] by performing one stochastic step in 
the direction of the natural gradient of the lower bound, 

rjt+i ^rjt + Pt [tm - Erit [t(/3)l) • (17) 


Pn+l{f3) CXp{Xn\ f3)qx„{0), (19) 

A„+i = argmin^^ Ukl [qx{(3) \\ Pn+i{(3)], (20) 

where Aq is set to rj. 

3.3. Streaming batch algorithm 

We will compare our streaming SVI algorithm and SVB to 
the following simple baseline algorithm. At each iteration, 
we randomly select a large batch of min(B, Nt) data points 
from the database, where Nt is the current number of data 
points in the database. We then run a batch algorithm to 
perform mean-field variational inference for a fixed num¬ 
ber of iterations (or until it converges). Once the train¬ 
ing is complete, we replace the current parameters with the 
parameters found by the batch algorithm and immediately 
restart the training on a newly sampled dataset of size B. 

4. Experiments 

In the following we demonstrate the usefulness of our pro¬ 
posed modifications to SVI. We start with a toy example 
and continue with increasingly realistic applications. 

4.1. MNIST 

As a first illustrative example, consider a mixture of multi¬ 
variate Bernoulli distributions with K components, where 
the global parameters are given by component parameters 
/9/j € [0,1]^ and mixture weights n. We assume uniform 
prior distributions on [3^. and tt and beta and Dirichlet dis¬ 
tributions for the approximate posterior distributions. 


To make the learning algorithm and hyperparameters less 
dependent on the speed with which new data points arrive, 
we use the following learning rate schedule, 

(18) 

where B is the batch size used. Instead of coupling the 
learning rate to the number of updates to the model, this 
schedule couples it to the number of data points added to 
the database since the start of learning, Nt — Nq. 

3.2. Streaming variational Bayes 

Streaming variational Bayes (SVB) has been proposed as 
an alternative to SVI for the streaming setting (Broderick 


a(/3)« rife,* /3h‘ ^^ 

Using 40 components, we applied this model to a binarized 
version of MNIST where each pixel has been randomly set 
to 0 or 1 according to its grayscale value interpreted as a 
probability. The and 6 ^., were randomly initialized by 
sampling from a gamma distribution with shape parameter 
100 and scale parameter 0 . 01 , and the 7 ^ were initialized 
at 1. We ran SVI with trust-region updates for 10 epochs 
(passes through the dataset) using 2 iterations for the inner 
loop (Algorithm 1) and SVI with natural gradient steps for 
20 epochs. For both methods we used a batch size of 200 
and the learning rate schedule p* = (r -f t)“”, where for k 
we used 0.5 and for r we used 100. 










A trust-region method for stochastic variational inference 


A 

Cluster centers 


o 2 1 1 ^ 3 % 

(nOr/SJHOSS 
35 ^ 3 / 37^^9 
9^1 9 2 O 1 S (o ^ 



B 


MNIST 



c 


Natural Image patches 


Q 



Figure 2. Mixture modeling results. A. Expected parameters, E[/3], under the mean-field solutions found via trust-region updates (top) 
and natural gradient steps (bottom). B. Lower bound averaged over multiple runs as a function of the number of updates of the param¬ 
eters. In case of the trust-region method, each step of the inner loop is counted as one update. C. Lower bound for 8 x 8 natural image 
patches sampled from the van Hateren dataset. The horizontal axis indicates the number of passes through the dataset. An update of the 
trust-region method takes twice as long as a natural gradient update. A comparison is made to memoized variational inference (MO) 
with batch sizes of 500 and 10,000. Error bars of TR and MO (B = 10k) are too small to be visible. 


Figure 2A shows the expected values of the probabilities 
under the posterior approximations found by the two 
methods. The solution found via natural gradient steps 
makes use of less than half of the mixture components, 
leading to significantly poorer performance (Figure 2B). 
This can be explained as follows. A mixture component 
which is inconsistent with the data will be assigned only a 
few data points. After updating 7 and A^, this mixture com¬ 
ponent will have lower prior probability and will shrink to¬ 
wards the prior distribution, which makes it less consistent 
with the data and in turn leads to the assignment of even 
fewer data points. The trust-region updates alleviate this 
problem by initializing (jj^k with 1 /K, that is, assigning the 
data points in the current batch equally to all mixture com¬ 
ponents. This forces the mixture components to initially 
become more consistent with the current batch of data. 

4.2. Natural image patches 

We next considered the task of modeling natural image 
patches sampled from the van Hateren image dataset (van 
Hateren & van der Schaaf, 1998). We used 8 x 8 im¬ 
age patches, which is a commonly used size in applica¬ 
tions such as image compression or restoration (e.g., Zo- 
ran & Weiss, 2011). We fit mixtures of Gaussians with 
50 components and normal-inverse-Wishart priors on the 
means and covariances of the components to 500,000 im¬ 
age patches. We used a batch size of 500, k = .5, r = 10, 
and a fixed number of 5 iterations for the inner loop of the 
trust-region method. For comparison we also trained mod¬ 
els with memoized variational inference (MO; Hughes & 
Sudderth, 2013) using the same settings for the priors and 
the same initialization of the distributions as with our meth¬ 


ods. MO is an online inference algorithm akin to gradient 
averaging. Its updates are less noisy than those of SVI but 
it requires memorizing each individual update, generating 
a considerable memory overhead when the dimensionality 
of the gradients or the number of batches is large. 

Figure 2 shows that SVI with natural gradient steps quickly 
gets stuck. MO with the same batch size performs slightly 
better, and MO with a large batch size of 10,000 does 
very well. Our trust-region method with a batch size of 
500 achieves comparable performance and uses much less 
memory. 

4.3. Wikipedia articles 

We applied our algorithm to latent Dirichlet allocation 
(LDA; Blei et al., 2003) and a dataset of approximately 
3.8M Wikipedia articles. The model is 

p(/3)oc^^^^ (^i) 

ki k 

p{x nm 5 Znm \ , /3) = ^TlZ-nm /^^nTnXnm • (22) 

Figure 3A shows the performance of our trust-region 
method plotted against natural gradient ascent for ran¬ 
domly selected hyperparameters. The parameters were se¬ 
lected from a grid with the batch size B and learning rate 
parameters selected from 

B € {10,50,100,500,1000}, 

K e 10.5,0.6,0.7,0.8,0.9,1.0}, 

T € { 1 , 10 , 100 , 1000 , 10000 }. 

For the trust-region method we used m steps in the inner 
loop of Algorithm 1 and M steps to compute (pn- For 
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Figure 3. Hyperparameter search results. The circled results’ hyperparameters are listed in Table 1 and Table 2. A. Performance on a 
validation set of Wikipedia articles. Shown are estimated log-likelihoods of the expected parameters found via the trust-region method 
and natural gradient ascent using randomly chosen hyperparameters. B. Performance on a validation set of Netfiix users. 


the corresponding natural gradient ascent we used mM/2 
steps to compute 0^. The two parameters were selected 
from (m, M) G {(5,40), (10, 20), (20,10), (5,100)}. We ran 
the trust-region method for 3 epochs and natural gradient 
ascent for 6 epochs. This gave both methods roughly the 
same amount of running time. The initial prior parameters 
were rj e {0.01, 0.05, 0.1,0.2,0.3} and € (0.01, 0.05, 0.1} 
but were updated via empirical Bayes. 

The two methods were evaluated on a separate validation 
set of 10,000 Wikipedia articles. We computed the ex¬ 
pected value of f3 under the variational distribution and es¬ 
timated the log-likelihood of the data given /3 using the 
Chib-style estimator described by Wallach et al. (2009). 
The trust-region method consistently outperformed natural 
gradient ascent. For 100 topics, the performance of natural 
gradient ascent varies widely while the trust-region method 
gave good results in almost all runs. In Table 1 we high¬ 
light the hyperparameters of some of the simulations. The 
table suggests that unlike natural gradient ascent, the trust- 
region method is able to perform well in particular with 
small batch sizes of as few as 50 data points. 

We next tested our streaming variant of SVI by simulating 
a data stream as follows. At each iteration of the simula¬ 
tion we took R documents from the Wikipedia dataset and 
added it to a set of active documents, where R was sampled 
from a Poisson distribution with a fixed rate. We then ob¬ 
served one word of each active document with probability 
p and added it to a database. This means that at any point 
in time, the database contained a number of complete and 
incomplete articles. Streaming SVI only had access to this 
database. To simulate the gradual accumulation of data we 
added a small delay after each observation. 

We compared our algorithm to streaming variational Bayes 


(SVB; Broderick et al., 2013) and the streaming batch al¬ 
gorithm described in Section 3.3. For a fair comparison, 
we extended SVB to also learn the a/, via empirical Bayes. 
Since in SVB rj is only used in the hrst update to A (Equa¬ 
tion 19), we did not estimate rj but instead ran SVB for 
p £ {0.01,0.05,0.1,0.2,0.3}. The performance we report 
is the maximum performance over all p. Since SVB can 
only process one document (or batch of documents) at a 
time but not one word at a time, SVB was given access to 
the entire document when the hrst word was added to the 
database, putting the other algorithms at a disadvantage. If 
we naively applied SVB to EDA on a word-by-word ba¬ 
sis, the beliefs about On would only be updated once after 
seeing the hrst word due to the mean-held approximation 
decoupling On from x„. Upon arrival of a subsequent word, 
the target distribution becomes 

p{0n,'Zn) OC q{On)q{Zn)p{Xn \ Zn), (23) 

SO that an update will not change q{0n). SVB would thus 
completely ignore dependencies between the topic assign¬ 
ments of different words. In our experiments, we used a 
batch size of 1,000 with SVB. The streaming batch algo¬ 
rithm was trained for 100 iterations on each randomly se¬ 
lected set of 50,000 data points. 

We tested the streaming algorithms on EDA with 100 and 
400 topics. Eor the hyperparameters of SVI we chose 
the best performing set of parameters found in the non¬ 
streaming setting (Eigure 3A). That is, for the trust-region 
method and natural gradient ascent we used hyperparame¬ 
ter sets 4 and 3, and 1 and 2, respectively (see Table 1). We 
only reduced k to 0.5 for 100 topics and to 0.4 for 400 top¬ 
ics, hnding that keeping the learning rate up for a longer 
time slightly improves performance, p and the of the 
streaming batch algorithm were initialized with the same 
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Figure 4. Performance of LDA on streamed Wikipedia articles. The horizontal axis shows the number of articles (complete and incom¬ 
plete) currently in the database (it is thus monotonically related to time). The vertical axis shows an estimate of the average log-likelihood 
on a test set separate from the validation set used in Figure 3. Dashed lines indicate the performance of SVI trained on the entire dataset 
and evaluated on the test set using the same hyperparameters as the streaming variants. 


parameters used with the trust-region method. 

We find that natural gradient ascent gives extremely poor 
performance in the streaming setting and gets stuck in a lo¬ 
cal optimum early on. Using our trust-region method, on 
the other hand, we were able to achieve a performance at 
least as good as the best performing natural gradient as¬ 
cent of the non-streaming setting (Figure 4). The simple 
streaming batch algorithm also performs well, although its 
performance is limited by the fixed batch size. 

SVB with a batch size of 1,000 performs much better than 
streaming SVI with natural gradient steps, but worse than 
our trust-region method or the streaming batch algorithm. 
We note that by using larger batches, e.g. of size 50,000, 
SVB can be made to perform at least as well in the limit 
as the streaming batch algorithm (assuming the data is sta¬ 
tionary, as is the case here). However, SVB would only be 
able to start training once the first 50,000 data points have 
arrived. The streaming batch algorithm, on the other hand, 
will simply take all data points available in the database. 

4.4. Netflix 

We applied LDA to the task of collaborative filtering us¬ 
ing the Netflix dataset of 480,000 user ratings on 17,000 
movies. Each rating is between 1 and 5. Like Gopalan 
et al. (2013), we thresholded the ratings and for each user 
kept only a list of movies with ratings of 4 or 5. For this 
“implicit” version of the Netflix data, the task is to predict 
which movies a user likes based on movies the user previ¬ 
ously liked. 20% of each user’s movies were kept for evalu¬ 
ation purposes and were not part of the training data. Again 
following Gopalan et al. (2013), we evaluate algorithms by 
computing normalizedprecision-at-M and recall-at-M. Let 
L be the test set of movies liked by a user and let P be a list 


Table 1. Hyperparameters corresponding to Wikipedia experi¬ 
ments in Figure 3A. B is the batch size and m and M control 
the number of iterations in the inner loops of each trust-region 
step (see text for details). 
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B 

a 
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T 

m 

M 

1 

500 

0.1 

0.2 

0.6 

10 

10 

20 

2 

1000 

0.1 

0.05 

0.7 

100 

20 

10 

3 

500 

0.1 

0.2 

0.7 

100 

10 

20 

4 

50 

0.1 

0.2 

0.7 

10 

20 

10 

5 

50 

0.1 

0.2 

0.7 

1 

20 

10 

6 

10 

0.1 

0.01 

0.5 

100 

10 

20 


Table 2. Hyperparameters corresponding to the highlighted Net¬ 
flix experiments in Figure 3B. 


# 

B 

a 

V 

K 

T 

m 

M 

1 

10 

0.1 

0.05 

1.0 

1 

5 

40 

2 

100 

0.05 

0.3 

0.5 

100 

5 

40 

3 

500 

0.1 

0.3 

0.5 

10 

10 

20 


of M predicted movies. Then the two measures are given 
by 

precision = ^^4) 

Precision is normalized by min{|L|, |P|} rather than |P| so 
as to keep users with fewer than M movies from having 
only a small influence on the performance (Gopalan et al., 
2013). In the following, we will use M = 20. 

LDA is trained on the Netflix data as in the case of 
Wikipedia articles, only that here users play the role of ar- 
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Figure 5. Performance of LDA trained on a stream of Netflix users. The horizontal axis gives the current number of users in the database. 


tides and movies play the role of words. For prediction, 
we approximated the predictive distribution over the next 
movie liked by user n, 

P{Xni I ^n,<i) ~ J2k^q[^nk]^q[l3kx„J, (25) 

where x„ corresponds to the 80% movies not used for 
evaluation. We then took the 20 most likely movies under 
this predictive distribution which were not already in x„_<i 
as predictions. 

Figure 3B shows results obtained with random hyperpa¬ 
rameters using the same grid used with the Wikipedia 
dataset. Evaluation was performed on a random subset of 
10,000 users. Interpretation of the results here is a little 
bit more challenging since we have to take into account 
two competing measures. Nevertheless, we And that for the 
better performing hyperparameter settings, the trust-region 
method again consistently outperforms natural gradient as¬ 
cent on both measures. Hyperparameters of the highlighted 
simulations are given in Table 2. 

The streamed Wikipedia articles allowed us to study the ef¬ 
fect of a growing dataset on SVI without having to worry 
about nonstationarities in the data. Here we study a slightly 
more realistic setting by streaming the users in the order 
they signed up for Netflix, thus potentially creating a non¬ 
stationary data stream. We tested LDA only with 100 top¬ 
ics since using more topics did not seem to help improve 
performance (Figure 3B). Figure 5 shows the results of the 
streaming experiment on a test set of 10,000 users different 
from the ones used in the parameter search. The hyperpa¬ 
rameters were again chosen based on the parameter search 
results (sets 2 and 3 in Table 2) but reducing k to 0.4. As in 
the Wikipedia experiment, we And that natural gradient as¬ 
cent performs poorly while the trust-region method is able 
to get to the same performance level as the best performing 
natural gradient ascent in the non-streaming setting. 


bilistic matrix factorization (PMF), and nonnegative matrix 
factorization (NMF) to show that our numbers are compet¬ 
itive^. 

5. Discussion 

We have proposed a new variant of stochastic variational 
inference which replaces the natural gradient step with a 
trust-region step. Our experiments show that this change 
can make SVI less prone to local optima and less sensitive 
to the choice of hyperparameters. We only explored an ap¬ 
plication of the trust-region method to mean-field approx¬ 
imations with conjugate priors. However, the same ideas 
might be applied in other settings, for example, by combin¬ 
ing them with other recent innovations in variational infer¬ 
ence such as structured SVI (Hoffman & Blei, 2014) and 
black box variational inference (Ranganath et al., 2014). 

We further described a simple strategy for applying SVI to 
streaming data and have shown that the trust-region updates 
are crucial for good performance in this setting. However, 
Figures 4 and 5 also reveal room for improvement as our 
streaming method does not yet reach the performance of 
the trust-region method applied to the full dataset. Since 
we used empirical Bayes to tune the parameters of the 
prior and the trust-region method’s performance is not very 
sensitive to the batch size, the only hyperparameters still 
requiring some tuning are the ones controlling the learn¬ 
ing rate schedule. Here we found that using larger learn¬ 
ing rates (k < 0.5) generally works well. Nevertheless, it 
would be desirable to And a fully automatic solution. We 
explored adapting the work of Ranganath et al. (2013) to 
our trust-region method but found that it did not work well 
in the streaming setting. 

^Gopalan et al. (2013) report worse results for LDA with SVI. 
This may be due to a suboptimal choice of hyperparameters. 


For comparison, we also include results of Gopalan et al. 
(2013) on hierarchical Poisson factorization (HPF), proba- 
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